Goto

Collaborating Authors

 Translational Bioinformatics


Appendix ProteinShake: Building datasets and benchmarks for deep learning on protein structures

Neural Information Processing Systems

Table 3: Comparison of models trained with different representations of protein structure across various tasks, on a random data split . The optimal choice of representation depends on the task. Shown are mean and standard deviation across four runs with different seeds. Table 4: Comparison of models trained with different representations of protein structure across various tasks, on a sequence data split . Table 5: Comparison of models trained with different representations of protein structure across various tasks, on a structure data split .




Multi-modal Transfer Learning between Biological Foundation Models

Neural Information Processing Systems

Modeling these sequences is key to understand disease mechanisms and is an active research area in computational biology. Recently, Large Language Models have shown great promise in solving certain biological tasks but current approaches are limited to a single sequence modality (DNA, RNA, or protein). Key problems in genomics intrinsically involve multiple modalities, but it remains unclear how to adapt general-purpose sequence models to those cases. In this work we propose a multi-modal model that connects DNA, RNA, and proteins by leveraging information from different pre-trained modality-specific encoders. We demonstrate its capabilities by applying it to the largely unsolved problem of predicting how multiple \rna transcript isoforms originate from the same gene (i.e.


Murmur2Vec: A Hashing Based Solution For Embedding Generation Of COVID-19 Spike Sequences

Ali, Sarwan, Murad, Taslim

arXiv.org Artificial Intelligence

Early detection and characterization of coronavirus disease (COVID-19), caused by SARS-CoV-2, remain critical for effective clinical response and public-health planning. The global availability of large-scale viral sequence data presents significant opportunities for computational analysis; however, existing approaches face notable limitations. Phylogenetic tree-based methods are computationally intensive and do not scale efficiently to today's multi-million-sequence datasets. Similarly, current embedding-based techniques often rely on aligned sequences or exhibit suboptimal predictive performance and high runtime costs, creating barriers to practical large-scale analysis. In this study, we focus on the most prevalent SARS-CoV-2 lineages associated with the spike protein region and introduce a scalable embedding method that leverages hashing to generate compact, low-dimensional representations of spike sequences. These embeddings are subsequently used to train a variety of machine learning models for supervised lineage classification. We conduct an extensive evaluation comparing our approach with multiple baseline and state-of-the-art biological sequence embedding methods across diverse metrics. Our results demonstrate that the proposed embeddings offer substantial improvements in efficiency, achieving up to 86.4\% classification accuracy while reducing embedding generation time by as much as 99.81\%. This highlights the method's potential as a fast, effective, and scalable solution for large-scale viral sequence analysis.


Prot2Token: A Unified Framework for Protein Modeling via Next-Token Prediction

Pourmirzaei, Mahdi, Esmaili, Farzaneh, Alqarghuli, Salhuldin, Pourmirzaei, Mohammadreza, Han, Ye, Chen, Kai, Rezaei, Mohsen, Wang, Duolin, Xu, Dong

arXiv.org Artificial Intelligence

The diverse nature of protein prediction tasks has traditionally necessitated specialized models, hindering the development of broadly applicable and computationally efficient Protein Language Models (PLMs). In this work, we introduce Prot2Token, a unified framework that overcomes these challenges by converting a wide spectrum of protein-related predictions-from sequence-level properties and residue-specific attributes to complex inter-protein interactions-into a standardized next-token prediction format. At its core, Prot2Token employs an autoregressive decoder, conditioned on embeddings from pre-trained protein encoders and guided by learnable task tokens, to perform diverse predictions. This architecture uniquely facilitates multi-task learning, enabling general-purpose decoders to generalize across five distinct categories. We present extensive experimental validation across a variety of benchmarks, demonstrating Prot2Token's predictive power in different types of protein-prediction tasks. In 3D structure prediction, Prot2Token delivers substantial speedups (up to 1000x faster than AlphaFold2 with MSA on the same hardware) while, across other numerous tasks, matching or surpassing specialized methods. Beyond that, we introduce an auxiliary self-supervised decoder pre-training approach to improve spatially sensitive task performance. Prot2Token thus offers a step towards standardizing biological prediction into a generative interface, promising to accelerate biological discovery and the development of novel therapeutics. The code is available at https://github.com/mahdip72/prot2token .


STAR-GO: Improving Protein Function Prediction by Learning to Hierarchically Integrate Ontology-Informed Semantic Embeddings

Akça, Mehmet Efe, Uludoğan, Gökçe, Özgür, Arzucan, Baytaş, İnci M.

arXiv.org Artificial Intelligence

Accurate prediction of protein function is essential for elucidating molecular mechanisms and advancing biological and therapeutic discovery. Yet experimental annotation lags far behind the rapid growth of protein sequence data. Computational approaches address this gap by associating proteins with Gene Ontology (GO) terms, which encode functional knowledge through hierarchical relations and textual definitions. However, existing models often emphasize one modality over the other, limiting their ability to generalize, particularly to unseen or newly introduced GO terms that frequently arise as the ontology evolves, and making the previously trained models outdated. We present STAR-GO, a Transformer-based framework that jointly models the semantic and structural characteristics of GO terms to enhance zero-shot protein function prediction. STAR-GO integrates textual definitions with ontology graph structure to learn unified GO representations, which are processed in hierarchical order to propagate information from general to specific terms. These representations are then aligned with protein sequence embeddings to capture sequence-function relationships. STAR-GO achieves state-of-the-art performance and superior zero-shot generalization, demonstrating the utility of integrating semantics and structure for robust and adaptable protein function prediction. Code is available at https://github.com/boun-tabi-lifelu/stargo.


ProteinPNet: Prototypical Part Networks for Concept Learning in Spatial Proteomics

McConnell, Louis, Sun, Jieran, Maffei, Theo, Gottardo, Raphael, Rapsomaniki, Marianna

arXiv.org Artificial Intelligence

Understanding the spatial architecture of the tumor microenvironment (TME) is critical to advance precision oncology. We present ProteinPNet, a novel framework based on prototypical part networks that discovers TME motifs from spatial proteomics data. Unlike traditional post-hoc explanability models, ProteinPNet directly learns discriminative, interpretable, faithful spatial prototypes through supervised training. We validate our approach on synthetic datasets with ground truth motifs, and further test it on a real-world lung cancer spatial proteomics dataset. ProteinPNet consistently identifies biologically meaningful prototypes aligned with different tumor subtypes. Through graphical and morphological analyses, we show that these prototypes capture interpretable features pointing to differences in immune infiltration and tissue modularity. Our results highlight the potential of prototype-based learning to reveal interpretable spatial biomarkers within the TME, with implications for mechanistic discovery in spatial omics.


Beyond Atoms: Evaluating Electron Density Representation for 3D Molecular Learning

Suriana, Patricia, Rackers, Joshua A., Nowara, Ewa M., Pinheiro, Pedro O., Nicoloudis, John M., Sresht, Vishnu

arXiv.org Artificial Intelligence

Machine learning models for 3D molecular property prediction typically rely on atom-based representations, which may overlook subtle physical information. Electron density maps -- the direct output of X-ray crystallography and cryo-electron microscopy -- offer a continuous, physically grounded alternative. We compare three voxel-based input types for 3D convolutional neural networks (CNNs): atom types, raw electron density, and density gradient magnitude, across two molecular tasks -- protein-ligand binding affinity prediction (PDBbind) and quantum property prediction (QM9). We focus on voxel-based CNNs because electron density is inherently volumetric, and voxel grids provide the most natural representation for both experimental and computed densities. On PDBbind, all representations perform similarly with full data, but in low-data regimes, density-based inputs outperform atom types, while a shape-based baseline performs comparably -- suggesting that spatial occupancy dominates this task. On QM9, where labels are derived from Density Functional Theory (DFT) but input densities from a lower-level method (XTB), density-based inputs still outperform atom-based ones at scale, reflecting the rich structural and electronic information encoded in density. Overall, these results highlight the task- and regime-dependent strengths of density-derived inputs, improving data efficiency in affinity prediction and accuracy in quantum property modeling.